Privacy preserving deep learning is an emerging field in machine learning that aims to mitigate the privacy risks in the use of deep neural networks. One such risk is training data extraction from language models that have been trained on datasets , which contain personal and privacy sensitive information. In our study, we investigate the extent of named entity memorization in fine-tuned BERT models. We use single-label text classification as representative downstream task and employ three different fine-tuning setups in our experiments, including one with Differentially Privacy (DP). We create a large number of text samples from the fine-tuned BERT models utilizing a custom sequential sampling strategy with two prompting strategies. We search in these samples for named entities and check if they are also present in the fine-tuning datasets. We experiment with two benchmark datasets in the domains of emails and blogs. We show that the application of DP has a huge effect on the text generation capabilities of BERT. Furthermore, we show that a fine-tuned BERT does not generate more named entities entities specific to the fine-tuning dataset than a BERT model that is pre-trained only. This suggests that BERT is unlikely to emit personal or privacy sensitive named entities. Overall, our results are important to understand to what extent BERT-based services are prone to training data extraction attacks.
translated by 谷歌翻译
Short text classification is a crucial and challenging aspect of Natural Language Processing. For this reason, there are numerous highly specialized short text classifiers. However, in recent short text research, State of the Art (SOTA) methods for traditional text classification, particularly the pure use of Transformers, have been unexploited. In this work, we examine the performance of a variety of short text classifiers as well as the top performing traditional text classifier. We further investigate the effects on two new real-world short text datasets in an effort to address the issue of becoming overly dependent on benchmark datasets with a limited number of characteristics. Our experiments unambiguously demonstrate that Transformers achieve SOTA accuracy on short text classification tasks, raising the question of whether specialized short text techniques are necessary.
translated by 谷歌翻译
Annotation of multimedia data by humans is time-consuming and costly, while reliable automatic generation of semantic metadata is a major challenge. We propose a framework to extract semantic metadata from automatically generated video captions. As metadata, we consider entities, the entities' properties, relations between entities, and the video category. We employ two state-of-the-art dense video captioning models with masked transformer (MT) and parallel decoding (PVDC) to generate captions for videos of the ActivityNet Captions dataset. Our experiments show that it is possible to extract entities, their properties, relations between entities, and the video category from the generated captions. We observe that the quality of the extracted information is mainly influenced by the quality of the event localization in the video as well as the performance of the event caption generation.
translated by 谷歌翻译
The goal of graph summarization is to represent large graphs in a structured and compact way. A graph summary based on equivalence classes preserves pre-defined features of a graph's vertex within a $k$-hop neighborhood such as the vertex labels and edge labels. Based on these neighborhood characteristics, the vertex is assigned to an equivalence class. The calculation of the assigned equivalence class must be a permutation invariant operation on the pre-defined features. This is achieved by sorting on the feature values, e. g., the edge labels, which is computationally expensive, and subsequently hashing the result. Graph Neural Networks (GNN) fulfill the permutation invariance requirement. We formulate the problem of graph summarization as a subgraph classification task on the root vertex of the $k$-hop neighborhood. We adapt different GNN architectures, both based on the popular message-passing protocol and alternative approaches, to perform the structural graph summarization task. We compare different GNNs with a standard multi-layer perceptron (MLP) and Bloom filter as non-neural method. For our experiments, we consider four popular graph summary models on a large web graph. This resembles challenging multi-class vertex classification tasks with the numbers of classes ranging from $576$ to multiple hundreds of thousands. Our results show that the performance of GNNs are close to each other. In three out of four experiments, the non-message-passing GraphMLP model outperforms the other GNNs. The performance of the standard MLP is extraordinary good, especially in the presence of many classes. Finally, the Bloom filter outperforms all neural architectures by a large margin, except for the dataset with the fewest number of $576$ classes.
translated by 谷歌翻译
现实世界中的大规模图形数据通常是动态而不是静态。数据随着时间的推移而出现的新节点,边缘,甚至是类,例如引用网络和研发协作网络。图形神经网络(GNNS)已成为众多关于图形结构数据的任务的标准方法。在这项工作中,我们采用了两步程序来探索GNN如何递增地适应新的未完成图形数据。首先,我们分析标准基准数据集的转换和归纳学习之间的边缘。在归纳预测后,我们将未标记的数据添加到图表中并显示模型稳定。然后,我们探索不断添加越来越多的标记数据的情况,同时考虑案例,在任何情况下都没有使用类标签注释。此外,我们在图表演变时介绍了新的类,并探索了自动检测来自先前看不见的类学的方法。为了以原则的方式处理不断发展的图形,我们提出了一个终身学习框架,用于图表数据以及评估协议。在本框架中,我们评估代表性的GNN架构。我们观察到模型参数内的隐式知识在显式知识时变得更加重要,即来自过去任务的数据,是有限的。我们发现,在开放世界节点分类中,令人惊讶地少数过去任务的数据足以达到通过从所有过去任务中记住数据达到的性能。在看不见的类检测的具有挑战性任务中,我们发现使用加权交叉熵损失对于稳定性很重要。
translated by 谷歌翻译
大型的语言模型(PRELMS)正在彻底改变所有基准的自然语言处理。但是,它们的巨大尺寸对于小型实验室或移动设备上的部署而言是过分的。修剪和蒸馏等方法可减少模型尺寸,但通常保留相同的模型体系结构。相反,我们探索了蒸馏预告片中的更有效的架构,单词的持续乘法(CMOW),该构造将每个单词嵌入为矩阵,并使用矩阵乘法来编码序列。我们扩展了CMOW体系结构及其CMOW/CBOW-HYBRID变体,具有双向组件,以提供更具表现力的功能,在预绘制期间进行一般(任务无义的)蒸馏的单次表示,并提供了两种序列编码方案,可促进下游任务。句子对,例如句子相似性和自然语言推断。我们的基于矩阵的双向CMOW/CBOW-HYBRID模型在问题相似性和识别文本范围内的Distilbert具有竞争力,但仅使用参数数量的一半,并且在推理速度方面快三倍。除了情感分析任务SST-2和语言可接受性任务COLA外,我们匹配或超过ELMO的ELMO分数。但是,与以前的跨架结构蒸馏方法相比,我们证明了检测语言可接受性的分数增加了一倍。这表明基于基质的嵌入可用于将大型预赛提炼成竞争模型,并激励朝这个方向进行进一步的研究。
translated by 谷歌翻译
Modern multi-document summarization (MDS) methods are based on transformer architectures. They generate state of the art summaries, but lack explainability. We focus on graph-based transformer models for MDS as they gained recent popularity. We aim to improve the explainability of the graph-based MDS by analyzing their attention weights. In a graph-based MDS such as GraphSum, vertices represent the textual units, while the edges form some similarity graph over the units. We compare GraphSum's performance utilizing different textual units, i. e., sentences versus paragraphs, on two news benchmark datasets, namely WikiSum and MultiNews. Our experiments show that paragraph-level representations provide the best summarization performance. Thus, we subsequently focus oAnalysisn analyzing the paragraph-level attention weights of GraphSum's multi-heads and decoding layers in order to improve the explainability of a transformer-based MDS model. As a reference metric, we calculate the ROUGE scores between the input paragraphs and each sentence in the generated summary, which indicate source origin information via text similarity. We observe a high correlation between the attention weights and this reference metric, especially on the the later decoding layers of the transformer architecture. Finally, we investigate if the generated summaries follow a pattern of positional bias by extracting which paragraph provided the most information for each generated summary. Our results show that there is a high correlation between the position in the summary and the source origin.
translated by 谷歌翻译
Traditional approaches for data anonymization consider relational data and textual data independently. We propose rx-anon, an anonymization approach for heterogeneous semi-structured documents composed of relational and textual attributes. We map sensitive terms extracted from the text to the structured data. This allows us to use concepts like k-anonymity to generate a joined, privacy-preserved version of the heterogeneous data input. We introduce the concept of redundant sensitive information to consistently anonymize the heterogeneous data. To control the influence of anonymization over unstructured textual data versus structured data attributes, we introduce a modified, parameterized Mondrian algorithm. The parameter $\lambda$ allows to give different weight on the relational and textual attributes during the anonymization process. We evaluate our approach with two real-world datasets using a Normalized Certainty Penalty score, adapted to the problem of jointly anonymizing relational and textual data. The results show that our approach is capable of reducing information loss by using the tuning parameter to control the Mondrian partitioning while guaranteeing k-anonymity for relational attributes as well as for sensitive terms. As rx-anon is a framework approach, it can be reused and extended by other anonymization algorithms, privacy models, and textual similarity metrics.
translated by 谷歌翻译
图形神经网络(GNNS)已成为诸如节点分类等图形结构数据上的众多任务的标准方法。然而,现实世界图形通常会随着时间的推移而发展,甚至可能出现新的课程。我们将这些挑战塑造为终身学习的实例,其中学习者面临一系列任务,可能接管过去任务中获取的知识。这些知识可以明确地作为历史数据存储或隐含地存储在模型参数中。在这项工作中,我们系统地分析了隐式和明确知识的影响。因此,我们提出了一种在图中终身学习的增量培训方法,并根据$ k $ -.neighborface时间差异引入了一种新的度量,以解决历史数据中的差异。我们将培训方法应用于五个代表性GNN架构,并在三个新的终身节点分类数据集中评估它们。我们的研究结果表明,与培训图表数据的完整历史训练相比,不超过50%的GNN接收领域将至少保留95%的准确性。此外,我们的实验证实,当有更少的明确知识可用时,隐式知识变得更加重要。
translated by 谷歌翻译
To apply federated learning to drug discovery we developed a novel platform in the context of European Innovative Medicines Initiative (IMI) project MELLODDY (grant n{\deg}831472), which was comprised of 10 pharmaceutical companies, academic research labs, large industrial companies and startups. The MELLODDY platform was the first industry-scale platform to enable the creation of a global federated model for drug discovery without sharing the confidential data sets of the individual partners. The federated model was trained on the platform by aggregating the gradients of all contributing partners in a cryptographic, secure way following each training iteration. The platform was deployed on an Amazon Web Services (AWS) multi-account architecture running Kubernetes clusters in private subnets. Organisationally, the roles of the different partners were codified as different rights and permissions on the platform and administrated in a decentralized way. The MELLODDY platform generated new scientific discoveries which are described in a companion paper.
translated by 谷歌翻译